a post-clustering differential expression method
July 5, 2024
What is differential expression?
What are the problems with existing methods?
How does ClusterDE solve these problems?
Does ClusterDE really solve these problems?
What are other variations on ClusterDE that we might consider?
Cell-by-gene matrix:
\[ \begin{bmatrix} Y_{11} & \dots & Y_{1m} \\ \vdots & \ddots & \\ Y_{n1} & & Y_{nm} \end{bmatrix} \]
Ultimate goal: identify cell types.
\[ \begin{bmatrix} Y_{11} & \dots & Y_{1m} & Z_1\\ \vdots & \ddots & & \vdots \\ Y_{n1} & & Y_{nm} & Z_n \end{bmatrix} \]
Assumption: only 2 cell types!
\[ Z_i \in \{0, 1\} \]
Use cell-type marker genes to identify cell types.
a potential cell-type marker
a potential set of cell-type markers
Hypothesis tests give us candidates for cell-type markers.
top DE genes identified by hypothesis tests
Figure S16
However, we used the data twice.
We clustered the cells.
We tested each gene to see how different the expression levels are between the two clusters. However, we already know that this variation is there.
In other words: we forced the genes to exhibit different distributions in the two clusters.
When our clusters are bad, we open ourselves up to false discoveries.
We only consider two cell types, so \(Z \in \{0, 1\}\)
We want to test
\[H_{0j} : \mu_{Z = 0, j} = \mu_{Z = 1, j}\]
But we can only test \[H_{0j}^{DD} : \mu_{\hat{Z} = 0, j} = \mu_{\hat{Z} = 1, j}\]
False discoveries occur when this does NOT hold
\[H_{0j}^{DD} : \mu_{\hat{Z} = 0, j} = \mu_{\hat{Z} = 1, j}\] , but this holds \[H_{0j} : \mu_{Z = 0, j} = \mu_{Z = 1, j}\]
In words: we made the right decision, but we set up the wrong test.
We test in order to discover differentially expressed genes.
False discoveries are genes that are not
\[ \text{FDR} := \# \]
A graphical illustration of ClusterDE.
Intuition: compute differences between the real data and a concrete null hypothesis.
Differences from this synthetic data imply differentially expressed genes.
rmvnb()?Assumption: each gene follows a negative binomial distribution.
\[ Y_j \sim NB(\mu_j, \sigma_j^2) \]
But our data doesn’t look like this:
rmvnb()Null generation steps.
Theorem (Probability Integral Transform): \(F_X(X) \sim \text{Uniform}(0, 1)\).
Proof: the standard proof of the PIT found on the Wikipedia page.
\[ \begin{align} F_Y(y) &= P(Y \leq y)\\ &= P(F_X(X) \leq y) && \text{(substituted the definition of } Y)\\ &= P(X \leq F_X^{-1}(y)) && \text{(applied } F_X^{-1} \text{ to both sides)}\\ &= F_X(F_X^{-1}(y)) && \text{(the definition of a CDF)}\\ &= y \end{align} \]
We can transform a random variable into a different random variable using their CDFs.
\[U = F_X(X) \sim \text{Uniform}(0, 1)\]
\[X = F_X^{-1}(U)\]
Theorem (Sklar’s Theorem): Let \(\mathbf{X}\) be a \(m\)-dimensional random vector with joint cumulative distribution function \(F\) and marginal distribution functions \(F_j, j = 1, ..., m\). The joint CDF can be expressed as
\[ F(x_1, ..., x_m) = C(F_1(x_1), ..., F_m(x_m)) \]
with associated probability density (or mass) function
\[ f(x_1, ..., x_m) = c(F_1(x_1), ..., F_m(x_m)) f_1(x_1) ... f_m(x_m) \]
for a \(m\)-dimensional copula \(C\) with copula density \(c\).
The inverse also holds: the copula corresponding to a multivariate CDF \(F\) with marginal distribution functions \(F_j, j = 1, ..., m\) can be expressed as
\[ C(u_1, ..., u_m) = F(F_1^{-1}(u_1), ..., F_m^{-1}(u_m)) \] , and the copula density (or mass) function is
\[ c(u_1,...,u_m) = \frac{f(F_1^{-1}(u_1), ..., F_m^{-1}(u_m))} {f_1(F_1^{-1}(u_1)) ... f_m(F_m^{-1}(u_m))} \] .
Sklar’s theorem allows us to separately model the individual variables and their dependence.
For convenience, we will choose the Gaussian copula.
\[ C(\mathbf{u}; \mathbf{R}) = \Phi_{\mathbf{R}}(\Phi^{-1}(u_1), ..., \Phi^{-1}(u_m)) \]
Now, our goal is to estimate:
the marginal parameters \(\{\mu_j, \sigma_j\}_{j = 1}^m\)
the covariance matrix\(\mathbf{R}\)
For each gene \(j\), estimate \(\{\mu_j, \sigma_j\}\) using maximum likelihood. These are the marginal distributions.
Separately, use the Gaussian copula to model the dependence structure.
rmvnb() using the Gaussian copulaSeparately, use the Gaussian copula to model the dependence structure.
\[ \begin{bmatrix} \tilde{X}_{11} & \dots & \tilde{X}_{1m} \\ \vdots & \ddots & \\ \tilde{X}_{n1} & & \tilde{X}_{nm} \end{bmatrix} \]
\[ \begin{bmatrix} \hat{F}_1^{-1}(\Phi(\tilde{X}_{11})) & \dots & \hat{F}_m^{-1}(\Phi(\tilde{X}_{1m})) \\ \vdots & \ddots & \\ \hat{F}_1^{-1}(\Phi(\tilde{X}_{n1})) & & \hat{F}_m^{-1}(\Phi(\tilde{X}_{nm})) \end{bmatrix} = \begin{bmatrix} \tilde{Y}_{11} & \dots & \tilde{Y}_{1m} \\ \vdots & \ddots & \\ \tilde{Y}_{n1} & & \tilde{Y}_{nm} \end{bmatrix} \]
Null generation steps.
Clustering step
Hypothesis testing step
TODO: explain what Clipper is.
TODO: add a reminder of the FDR problem.
Given the target and null DE scores, compute a contrast score for gene \(j\) as \(C_j := S_j - \tilde{S}_j\).
FDR control.
\[ t* = \min\left\{t \in \{|C_j|: C_j \neq 0\}: \frac{1+\#\{j:C_j \leq -t\}}{\#\{j:C_j \geq t\} \lor 1} \leq q\right\} \]
Simulations
Real data
homogeneous cell line (e.g. H2228): no DE genes
different cell types: biologically meaningful DE genes
H2228: “the gold standard for benchmarking the accuracy of clustering.”
One true cluster
Successfully identified no DE genes
Choose 2 from this large set
Chose 2 that we know are similar
some top DE genes have similar expression levels
identified more biologically meaningful DE genes
Knockoffs
Change setting to prediction of cell cluster label
Generate features that are uncorrelated with the response
Permutation:
Power and FDR comparisons
Discovering DE genes helps us generate cell-type markers, which identify the cell types in our sample.
Naive discovery methods (e.g. Seurat) double-dip: they cluster the data, then perform hypothesis tests using those clusters. This opens up an increased risk of false discoveries.
ClusterDE outperforms other methods when - there is only one cell type, by identifying few DE genes - there are distinct cell types, by identifying more meaningful DE genes
The copula generation strategy outperforms other alternatives in terms of FDR and power.
Comparison of synthetic null generation strategies.